rpois(1000, lambda = 3)Posterior predictive distributions and checks
Introduction
Let’s describe a general model for a data generating process as \(y \sim p(y | \theta)\), with data \(y\) generated from the family of probability distributions \(p\) indexed with parameter \(\theta\).
There is no doubt that there is uncertainty in \(y\) since it is a random variable, by construction, and thus can take on a range of values based on its probability distribution.
In the Bayesian setup, we model uncertainty over \(\theta\) by assuming it comes from its own distribution, with prior \(p(\theta)\). So our full data model is:
\[
y \sim p(y | \theta); \\
\theta \sim p(\theta)
\]
Next we fit our model (using some method or another) to data using Bayes’ theorem to get the posterior distribution of \(\theta\) given the data, \(p(\theta | y) \propto p(y | \theta) p(\theta)\) - i.e., the posterior distribution is proportional to the likelihood times the prior (up to the normalizing denominator, \(p(y)\)).
This gives us an updated probability distribution over possible values for \(\theta\).
Next, it is common to sample from the posterior predictive distribution, where the idea is to generate new observations of \(y\) by simulating the data generating process that we defined above - where \(\theta \sim p(\theta | y)\) and \(y \sim p(y | \theta)\).
This process incorporates the uncertainty from 2 different sources:
- Over the parameter values, i.e., the uncertainty in \(\theta\), described by posterior \(p(\theta | y)\)
- And the uncertainty in the sampling distribution itself, i.e., the uncertainty that comes from the data being generated by \(p(y | \theta)\).
So, while it may be tempting to plug in a single value of \(\theta\), say a \(\hat{\theta}\), into the sampling distribution to generate new values of \(y\), this would ignore the uncertainty in \(\theta\) and cause the predictive distribution to be too narrow.
These 2 sources of uncertainty are sometimes referred to as epistemic and aleatoric uncertainty, where epistemic uncertainty arises from \(\theta\) being unkown and estimated based on a finite sample of data, and aleatoric uncertainty arising because even if we knew \(\theta\) exactly, there is still uncertainty in the values of \(y\) (by definition, since \(y\) is a random variable, and so there is some missing information about its generation). > Aleatoric uncertainty is sometimes called stochastic uncertainty, and its name is derived from Latin words referencing gambling. Aleatoric uncertainty represents the unkowns that differ each time we run an experiment, because we do not have sufficient knowledge about all the factors influencing the data - sometimes referred to as intrinsic randomness.
Epistemology is the study of knowledge, and epistemic uncertainty - or systematic uncertainty - indicates uncertainty due to a lack of knowledge about something that we could in principle know, perhaps due to lack of data or measurement error.
An obvious reason to do posterior predictive sampling is if we are interested in making probabilistic predictions of \(y\), which, again, should incorporate uncertainty in the parameter \(\theta\) and in the sampling distribution \(p(y | \theta)\).
But even if we are primarily interested in \(\theta\), it is still important to analyze the posterior predictive distribution because it indicates the fit of the model - if we can’t use our model to generate a simulated distribution of \(y\) that looks and behaves something like the actual distribution of \(y\) (in our observed data), then our model is not doing well.
In practice, this distribution is going to be difficult or impossible to calculate analytically, but we can estimate it using sampling.
This is essentially a 2 step process, repeated many times:
- We sample a single \(\theta\) from its posterior distribution, \(\theta_i \sim p(\theta | y)\), which implies a sampling distribution for \(y\), such that \(y \sim p(y | \theta_i)\).
- Then we sample a single draw from this sampling distribution, which we’ll call \(y^{rep}_i\).
- We refer to these draws as \(y^{rep}\) since if we had covariates, we’d be simulating outcome values over the observed values of the covariates to perform this check - i.e., directly replicating the theoretical process that created the data sample \(y\). We could also simulate \(y\) for new covariate values if we were interested in prediction, in which case we can still use this process to generate a posterior predictive distribution, but will generally use different notation (often \(\tilde{y}\)).
By repeating many times, we build up a collection of \(y^{rep}\) values that each came from different variations of \(\theta_i\) (each implying variations of sampling distribution \(p(y | \theta_i)\)), and this collection of samples approximates the posterior predictive distribution: \(p(y^{rep} | y)\)
Formally the posterior predictive distribution can be defined as:
\[ p(y^{rep} | y) = \int p(y^{rep} | \theta) p(\theta | y) d\theta \]
The product under the integral reduces to the joint posterior density, \(p(y^{rep}, \theta | y)\) (since \(p(a,b) = p(a|b)p(b)\)), so that integrating over \(\theta\) is just marginalizing out the parameters, leaving the predictive density of the data (see here).
Example 1: Poisson model
For a simple example, consider a Poisson distribution with \(\theta = \lambda\).
We can imagine the extreme case where we are very certain about \(\lambda\) (so minimal epistemic uncertainty), so that the distribution over \(\lambda\) spikes around its likely value, and samples from the distribution, \(\lambda_i\), will be more-or-less identical during each iteration - practically fixed. Therefore, the implied sampling probability distribution \(p(y | \lambda_i)\) will also not really change from iteration to iteration, and the values of \(y^{rep}\) sampled will essentially be governed by the same sampling distribution at each iteration. Thus, after repeating the generative process many times we’d expect the distribution of \(y^{rep}\) values to closely match that of the practically-fixed sampling distribution.
In fact, if we were absolutely certain about the value of \(\lambda\) (0 epistemic uncertainty), say we know \(\lambda=3\), then we could generate \(y^{rep}\) using a simple random number generator (though this is not Bayesian):
But imagine that there is a small amount of uncertainty in \(\lambda\) which we model with a probability distribution that spikes around the value \(3\), as shown by the first histogram below. Then each time we sample a \(\lambda_i\), the implied sampling distribution will not change much, as shown by the random PMFs in the middle figure.
As a result, accumulating \(y^{rep}\) values from this process yields a distribution that closely resembles \(Poisson(3)\), as seen in the bottom figure.
Posterior predictive distribution under minimal epistemic uncertainty
Code
n <- 1000
# p(\lambda | y)
l_rep <- abs(rnorm(n, mean = 3, sd = 0.05))
hist(l_rep, main = expression("p(" ~ lambda ~ " | y)"), xlab = "")Code
# p(y | \lambda)
par(mfrow = c(2, 3), oma = c(6,0,2,0))
xp <- seq(0, 10, by = 1)
for (. in 1:6) {
.l <- sample(l_rep, 1)
plot(xp, dpois(xp, lambda = .l),
ylab = "", xlab = "",
main = paste("lambda = ", round(.l, 2)),
pch = 19)
}
title(main = expression("p(" ~ "y" ~ "|" ~ lambda[i] ~ ")"),
sub = "PMFs implied by each sampled lambda",
outer = TRUE)Code
# p(y^{rep} | y)
y_rep <- rpois(n, lambda = l_rep)
hist(y_rep, probability = TRUE, main = expression("p(" ~ "y"^"rep" ~ "| y)"))
lines(density(y_rep, bw = 0.5), col = "red")
text(mean(y_rep), 0.,
paste("mean =", round(mean(y_rep), 2)),
pos = 3, col = "blue")However, if there is lots of uncertainty around \(\lambda\) so that its own distribution is relatively spread out, then each time we sample a \(\lambda_i\) it may imply a very different sampling distribution \(p(y | \lambda_i)\), so that each time we draw a \(y_{rep}\) value it could be coming from a very different distribution, resulting in a posterior predictive distribution \(p(y^{rep}|y)\) that incorporates much more uncertainty and may not closely resemble a single sampling distribution.
In the below figures, the only tweak is that the for the fake posterior distribution of \(\lambda\), its standard deviation is increased from \(0.05\) to \(2\), causing its distribution to be much more spread out. This leads to much wider variety in the \(\lambda_i\)’s being sampled.
The resulting posterior predictive distribution is still centered around \(3\) but is significantly more spread out relative to the above example, reflecting greater uncertainty.
Posterior predictive distribution under more epistemic uncertainty
Code
n <- 1000
# p(\lambda | y)
l_rep <- abs(rnorm(n, mean = 3, sd = 2))
hist(l_rep, main = expression("p(" ~ lambda ~ " | y)"), xlab = "")Code
# p(y | \lambda)
par(mfrow = c(2, 3), oma = c(6,0,2,0))
xp <- seq(0, 10, by = 1)
for (. in 1:6) {
.l <- sample(l_rep, 1)
plot(xp, dpois(xp, lambda = .l),
ylab = "", xlab = "",
main = paste("lambda = ", round(.l, 2)),
pch = 19)
}
title(main = expression("p(" ~ "y" ~ "|" ~ lambda[i] ~ ")"),
sub = "PMFs implied by each sampled lambda",
outer = TRUE)Code
# p(y^{rep} | y)
y_rep <- rpois(n, lambda = l_rep)
hist(y_rep, probability = TRUE, main = expression("p(" ~ "y"^"rep" ~ "| y)"))
lines(density(y_rep, bw = 0.5), col = "red")
text(mean(y_rep), 0.,
paste("mean =", round(mean(y_rep), 2)),
pos = 3, col = "blue")Again, the idea of posterior predictive checks is to confirm that our model is able to generate data that looks something like the observed data.
The first level is to just visually compare the historgrams of your \(y^{rep}\) values to the observed data to determine how things look, but once you’re able to calculate \(y^{rep}\) values you can do much more interesting things like estimate distributions over differences between the posterior predictive mean and the observed data mean, estimate event probabilities, etc.
See the Stan user manual for more ideas.
I found it helpful to see this process play out dynamically in Ben Lambert’s video:
https://www.youtube.com/watch?v=TMnXQ6G6E5Y&ab_channel=BenLambert
Example 2 : Linear Regression
As a slightly more complicated example, imagine \(y \sim \text{Normal}(\alpha + x \beta, \sigma_{\epsilon}^2)\), so \(\theta = (\alpha, \beta, \sigma_{\epsilon})\), which could also be written \(y = \alpha + x \beta + \epsilon\) where \(\epsilon \sim N(0, \sigma_\epsilon^2)\).
Here we would define priors over \(\alpha\), \(\beta\), and \(\sigma_{\epsilon}\), and then fit the model to some data to get posterior distributions over these parameters.
The pretend posterior distributions are specified below.
To generate the posterior predictive distribution, we first simulate a value for each of these parameters:
# p(\theta | y) ## after fitting, pretend these are the posterior distributions
alpha_rep <- rnorm(1, mean = -0.95, sd = 0.5)
beta_rep <- rnorm(1, mean = 3.1, sd = 0.16)
sig_rep <- abs(rnorm(1, mean = 2.1, sd = 0.36))
cat("THETA:\n",
" alpha: ", alpha_rep, "\n",
" beta: ", beta_rep, "\n",
" sigma_eps: ", sig_rep, "\n")THETA:
alpha: -1.58197
beta: 2.72268
sigma_eps: 1.940254
Next we fix these parameter values (“condition on them”) and this defines a probability distribution (from the Normal family) at each value of \(x\). For example, here are the distributions at some different values of \(x\):
Code
# p(y | \theta_i)
par(mfrow = c(3,3))
for (x in seq(-3, 3, length.out = 9)) {
quantiles <- seq(-20, 20, length.out = 1000)
pdf <- dnorm(quantiles, mean = alpha_rep + beta_rep * x, sd = sig_rep)
plot(quantiles, pdf, type = "l", xlab = "", ylab = "density", main = paste("x =", x))
}Next we would repeat this process over and over to generate a colllection of \(y^{rep}\) values, which would give us a sense of the posterior predictive distribution \(p(y^{rep} | y)\).
For example, for \(x = 3\):
# p(y^{rep} | y)
y_rep <- numeric(1000)
for (i in 1:length(y_rep)) {
alpha_rep <- rnorm(1, mean = -0.95, sd = 0.5)
beta_rep <- rnorm(1, mean = 3.1, sd = 0.16)
sig_rep <- abs(rnorm(1, mean = 2.1, sd = 0.36))
y_rep[i] <- rnorm(1, mean = alpha_rep + beta_rep * 3, sd = sig_rep)
}
hist(y_rep, breaks = 20, xlim = c(-20, 20),
main = expression("p(" ~ "y"^"rep" ~ "| y); x = 3"))It looks quite a bit like the PDF seen for \(x = 3\) in the previous figure, because the fake posterior distributions for \(\alpha\), \(\beta\), and \(\sigma_{\epsilon}\) have low variance (so not much epistemic uncertainty).
If we jack up the variance in the parameter’s posterior distributions:
y_rep <- numeric(1000)
for (i in 1:length(y_rep)) {
alpha_rep <- rnorm(1, mean = -0.95, sd = 2)
beta_rep <- rnorm(1, mean = 3.1, sd = 2)
sig_rep <- abs(rnorm(1, mean = 2.1, sd = 1))
y_rep[i] <- rnorm(1, mean = alpha_rep + beta_rep * 3, sd = sig_rep)
}
hist(y_rep, breaks = 20, xlim = c(-20, 20),
main = expression("p(" ~ "y"^"rep" ~ "| y); x = 3"),
sub = "Increased posterior uncertainty")In general, we don’t have to condition on an \(x\) value while analyzing the posterior predictive. We could just look at the entire distribution of \(y^{rep}\) values and compare with the entire distribution of observed \(y\).
For example, pretend we have the below data:
Code
sim <- within(list(), {
n <- 50
x <- seq(-3, 3, length.out = n)
alpha <- -1
beta <- 3
sigma <- 2
y <- rnorm(n, mean = alpha + beta * x, sd = sigma)
})
plot(sim$x, sim$y)And suppose we’ve already produced posterior distributions for \(\alpha\), \(\beta\), and \(\sigma_{\epsilon}\), so we could generate \(y^{rep}\) values for each of the \(x\) values and compare each replication with the actual distribution of \(y\):
Code
y_rep <- matrix(nrow = length(sim$x),
ncol = 100) ## number of replications
for (i in seq(1, ncol(y_rep))) {
alpha_rep <- rnorm(1, mean = -0.95, sd = 0.5)
beta_rep <- rnorm(1, mean = 3.1, sd = 0.16)
sig_rep <- abs(rnorm(1, mean = 2.1, sd = 0.36))
y_rep[, i] <- rnorm(length(sim$x),
mean = alpha_rep + beta_rep * sim$x,
sd = sig_rep)
}
layout(matrix(c(0,1,1,0,0, 0,1,1,0,0, 2,3,4,5,6, 7,8,9,10,11),
nrow = 4, byrow = TRUE))
hist(sim$y, main = "y (observed)", xlab = "")
for (i in seq(1, 10)) {
hist(y_rep[, i], main = paste("Replication", i), xlab = "", ylab = "")
}Or as a scatter plot:
Code
y_mean <- rowMeans(y_rep)
y_hdi <- t(apply(y_rep, 1, quantile, probs = c(0.025, 0.975)))
plot(sim$x, sim$y, col = "blue", pch = 16)
points(sim$x, y_mean, col = "red", pch = 1)
segments(sim$x, y_hdi[, 1], sim$x, y_hdi[, 2], col = "red", lty = 2)
legend("topleft",
legend = c("Observed", "Predicted Mean", "95% HDI"),
col = c("blue", "red", "red"),
pch = c(16, 1, NA),
lty = c(NA, NA, 2))In Stan, we can sample from the posterior predictive distribution by laying out the process described above in the generated quantities block.
For example:
...
generated quantities {
vector[N] y_rep;
for (i in 1:N) {
y_rep[i] = normal_rng(alpha + beta * x[i], sigma);
}
}
During each iteration of MCMC, after Stan has sampled a draw for each parameter \(\alpha\), \(\beta\), and \(\sigma\), it runs the generated quantities block using the sampled values to generate a \(y^{rep}\) value for each \(x\) value (\(x\) would be defined in the data block and the parameters defined in the parameter block).
Prior predictive checks
We simulate from the posterior predictive distribution of our model when we want to generate probabilistic predictions of the data, and also to check the fit of the model.
Likewise, we can simulate the model’s predictions of the data before it sees the data, and this is a prior predictive check.
The objective here is to assess whether our priors are appropriate. If we have a single parameter and thus one prior, this might not seem too important, but often we have many parameters or even hierarchical parameters, so it is important to see what the priors you set on the parameters imply about the prior distribution of the data.
In practice, this goes exactly the same as a posterior predictive check, except with no data. We simulate parameter values from each parameter’s prior distribution and feed those into the data’s sampling distribution to generate prior predictions of the data. Then we can plot simulations of the data or compute test statistics using this distribution.
Formally we write:
\[ y^{sim} \sim p(y) \]
where \(p(y)\) is the prior sampling distribution of the data.
We can simulate from the prior predictive distribution by generating parameter values from their prior distributions and using them to generate y values from the sampling distribution.
\[ \theta^{sim} \sim p(\theta) \] \[ y^{sim} \sim p(y | \theta^{sim}) \]
This procedure really produces draws from the joint distribution \(p(y, \theta)\) (probability of \(y\) and \(\theta\) values occurring together), but we obtain the marginal distribution \(p(y^{sim})\) by ignoring \(\theta^{sim}\), so that \(y^{sim} \sim p(y)\).
For the linear regression example, we can simulate the prior predictive distribution in R. We define some priors for parameters \(\alpha\), \(\beta\), and \(\sigma_{\epsilon}\), and then simulate from them.
# just for illustration, define functions that encode priors
alpha_prior <- function() rnorm(1, mean = 0, sd = 5)
beta_prior <- function() rnorm(1, mean = 0, sd = 5)
sig_prior <- function() abs(rnorm(1, mean = 3, sd = 2))Code
# simulate from priors to observe distributions
prior_params <- function() {
par(mfrow = c(2, 2))
# alpha
as <- replicate(1000, alpha_prior())
hist(as, main = expression(alpha), xlab = "")
points(x = sim$alpha, y = 0, pch = 13, col = "blue")
# beta
bs <- replicate(1000, beta_prior())
hist(bs, main = expression(beta), xlab = "")
points(x = sim$beta, y = 0, pch = 13, col = "blue")
# sigma
ss <- replicate(1000, sig_prior())
hist(ss, main = expression(sigma[epsilon]), xlab = "")
points(x = sim$sigma, y = 0, pch = 13, col = "blue")
#
plot(1, type="n", xaxt = "n", yaxt = "n",
xlab="", ylab="", xlim=c(0, 10), ylim=c(0, 10))
legend("center", legend = "True parameter (unkown)",
col = "blue", pch = 13)
}
prior_params()We can then simulate the data from the prior predictive distribution using the priors we defined. Here I’ve plotted the prior predictive distribution for several of the \(x\) values. Comparing them with the actual \(y\) values (in red) from our data, we can see that our priors are quite diffuse and very uncertain about the location of the data (due to large SDs).
Code
prior_check_dists <- function() {
ys <- matrix(nrow = length(sim$x), ncol = 1000)
for (i in seq(1, ncol(ys))) {
alpha_sim <- alpha_prior()
beta_sim <- beta_prior()
sig_sim <- sig_prior()
ys[, i] <- rnorm(length(sim$x),
mean = alpha_sim + beta_sim * sim$x,
sd = sig_sim)
}
par(mfrow = c(3, 3), xpd = TRUE)
plot.new()
legend("topright",
legend = c("Data", "Sim"), col = c("red", "grey30"), pch = c("|", "|"))
for (i in seq(1, 8)) {
hist(ys[i, ], main = paste("y[", i, "]"), xlab = "")
points(x = sim$y[i], y = 0, pch = "|", col = "red", cex = 2)
}
par(mfrow = c(1, 1))
}
prior_check_dists()We could also examine the distribution of possible regression lines generated from our priors (this of course ignores the variation from \(\sigma_\epsilon\)).
Code
# simulate and plot lines from the prior distributions, compare with data
prior_check_lines <- function() {
plot(NULL, xlim = range(sim$x) * 1.3, ylim = range(sim$y) * 1.3,
xlab = "x", ylab = "y")
for (i in seq(1, 1000)) {
alpha_sim <- alpha_prior()
beta_sim <- beta_prior()
abline(a = alpha_sim, b = beta_sim,
col = rgb("red", r = 0.5, b = 0, g = 0, alpha = 0.15))
}
points(sim$x, sim$y, col = "blue", pch = 16)
legend("topleft",
legend = c("Observed", "Prior Lines"),
col = c("blue", "red"),
lty = c(NA, 1),
pch = c(16, NA))
}
prior_check_lines()Here are the same plots, using more “informative” priors.
alpha_prior <- function() rnorm(1, mean = 0, sd = 1)
beta_prior <- function() rnorm(1, mean = 3.5, sd = 1)
sig_prior <- function() abs(rnorm(1, mean = 0, sd = 0.5))
prior_params()prior_check_dists()prior_check_lines()Helpful resources:
https://mc-stan.org/docs/stan-users-guide/posterior-prediction.html
https://mc-stan.org/docs/stan-users-guide/posterior-predictive-checks.html
https://mc-stan.org/docs/cmdstan-guide/generate_quantities_config.html